Credit Card Users Churn Prediction

Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

Objective

To come up with a classification model that will help the bank improve its services and minimize customers who renounce ther credit cards

Data Dictionary

1. Initial Analysis / Structure of the Data

Import Libraries

Read the dataset

Checking the number of rows and columns

Observation

Viewing the first 5 and last 5 rows of the dataset

Check the datatyes of columns

Observation

Check for Null values / Unique values / Duplicate rows

Observation

Observation

Observation

Checking for unique values in Non-Numeric Columns

Observation

checking the distribution of the target variable

Observation

Checking the unique values of Categorical Variables

Observation

Summary of DataSet

2 Data Pre-Processing

Converting 'abc' to NaN in Income Category to do missing value imputation later

3 EDA

Univariate Analysis

Observation on Attrition_Flag (target variable)

Observation

Observation on Gender

Observation

Observation on Eductaion Level

Observation

Observation on Marital Status

Observation

Observation on Income Category

Observation

Observation on Card_Category

Observation

Observation on Customer Age

Observation

Observation of Dependent Count

Observation

Observation on Credit Limit

Observation

Observation on Total_Revolving_Bal

Observation

Observation on Total_Trans_Amt

Observation

Observation on Total_Transaction_Count / Capping the values

Observation

Observation

Observation on Total_Ct_Chng_Q4_Q1 / Capping the values

Observation

Observation on Total_Amt_Chng_Q4_Q1 / Capping the values

Observation

Observation on Avg_Utilization_Ratio

Observation

Bivariate Analysis

Correlation Plot

Observation

Observation on Customer Age vs Attrition Flag

Observation

Observation on Gender vs Attrition Flag

Observation

Observation on Education_Level vs Attrition_Flag

Observation

Observation on Marital Status vs Attrition_Flag

Observation

Observation on Income Category vs Attrition_Flag

Observation

Observation on Card Category vs Attrition_Flag

Observation

Observation on Total_Relationship_count vs Attrition_Flag

Observation

Observation on Months_Inactive_12_mon vs Attrition_Flag

Observation

Observation on Age vs Prod Taken in Distribution Plot

Observation

Observation on Credit_Limit vs Prod Taken in Distribution Plot

Observation

Observation on Total_Trans_Amt vs Prod Taken in Distribution Plot

Observation

Observation on Total_Trans_Amt", "Credit_Limit", "Total_Revolving_Bal", "Total_Ct_Chng_Q4_Q1 against Attrition_Flag as boxplot

Observation

Summary of EDA

There is a perfect Corelation (1.0) between Credit_Limit and Avg_Open_to_Buy There is a high corelation (0.8) between Total Trans_amt and Total_Trans_Ct There is a high corelation (0.79) between Months_on_book and Customer Age There is some corelation (0.62) between Avg_Open_to_Buy and Total_Revolving_Bal Due to corelation, following columns are dropped from the dataset -Avg_Open_to_Buy -Total_Trans_Ct -Months_on_book -Avg_Utilization_Ratio

Customer Atrrition was highest on Ages 68, 66 and 59

There is no noticeable difference between Male and Female vs Attrition, though Female customers are a bit more in number in renouncing Credit Cards

Graduate Class holds the majority of Existing Customers. At the same time this class has the highest Attrition as well. High School educated class holds the next level as Existing Customers Comparatively Attrition is less in Uneducated and Post-Graduate class

Married Customers forms the majority of existing customers Attrition is almost similar among Married and Single Customers Attrition is lowest amongs the Divorced Customers

Majority of existing customers fall under Income Levels of less than 40K Attrition is also highest in this category The next highest category of customers is in 40-60k.

Majority of existing customers have subscribed to the Blue Credit Card The next category of existing customers is under Silver Category Gold and Platinum Cards have very minimal subscription

Majority of Existing Customers as well as Attrited Customers have around 3 banking products Attrition decreases as they hold more banking products (upto 6) Attrition increases from number of products held by customers from 1 to 3.

Customers who are inactive are maximum for 1-3 months of incativity (in a 12 month period) Attrition also gets higher around 3 months of inactivity.

Credit Limit for the majority of Customers are around 1500 and under 5000. These segment has many outliers. There are some extreme Credit Limits granted for a few around 35000

The range of Total_Trans_Amt is comparatively lower for Attrited Customers Credit Limit Values for Attrited Customers is also relatively lower for Attrited Customers (this has outliers) Total_Revolving_Bal for Attrited Customers is lower when compared to exiting customers Total_CT-Chng_Q4_Q1 values are also lower for Attrited Customers. (with outliers)

Feature Engineering

Missing value treatment

We will use KNN imputer to impute missing values.

Data Preparation for Modeling

Observation

Verifying the data is reversed back , with inversed mapping

Creating Dummy Variables

Building the Model

Model Evaluation Criterion

Model can make wrong prections

Important Case. Losing a customer is more important, so that Bank can necessary actions to retain the Credit Card Customer

How to reduce the lose ? Bank would want to Recall to be maximized, so that less chance of false negatives

Observation

Oversampling Training Data using SMOTE

Creating various models on Oversampled Train Data

Observation

Undersampling Training data

Creating various models on UnderSampled Train Data

Observation

Decide 3 best performing models, so that it can be tuned further using Hyperparameter Tuning

As per the Algorithm Comparison box plots graphs, following models have performed better among the 6 models

1) RandomForest

2) GradientBoost

3) Logistic Regression

RandomForest

Observation

Observation

Gradient Boost

Observation

GradientBoosting Model ( Hyperparameter Tuned with Randomized Search)

Observation

Logistic Regression

Observation

Logistic Regression Model ( Hyperparameter Tuned with Randomized Search)

Observation

Comparing Models

Observation

Performance on Test Set

Observation

Identifying the most important features

Observation

Productionalize the model using Pipelines

We will create 2 different pipelines, one for numerical columns and one for categorical columns For numerical columns, we will do missing value imputation as pre-processing For categorical columns, we will do one hot encoding and missing value imputation as pre-processing

We are doing missing value imputation for the whole data, so that if there is any missing value in the data in future that can be taken care of.

Business Insights and Recommendations